E-commerce segmentation and recommender system¶
CSCA 5632 Unsupervised Algorithms in Machine Learning Final Project¶
Nicolas Veas
GITHUB REPOSITORY¶
https://github.com/nicovy/ecommerce
Introduction¶
Understanding the audience of an e-commerce platform is crucial for effective marketing, as it enables the use of personalized strategies. Developing a customer segmentation model allows us to understand the different profiles and preferences of the e-commerce customers. This enhances customer engagement and supports implementing a recommender system that could help improve the shopping experience by providing personalized product suggestions that a specific customer will likely purchase.
In this project, I'll use unsupervised learning models to build a customer segmentation and a recommender system.
The data used is from kaggle from the following link:
https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis
Data¶
- The data set consists of 14 features and 55000 entires of customer transactions.
Objective¶
- To segment customers into distinct groups
- To build a recommender system that recommend products to customers
Data Dictionary¶
- CID (Customer ID): A unique identifier for each customer.
- TID (Transaction ID): A unique identifier for each transaction.
- Gender: The gender of the customer, categorized as Male or Female.
- Age Group: Age group of the customer, divided into several ranges.
- Purchase Date: The timestamp of when the transaction took place.
- Product Category: The category of the product purchased, such as Electronics, Apparel, etc.
- Discount Availed: Indicates whether the customer availed any discount (Yes/No).
- Discount Name: Name of the discount applied (e.g., FESTIVE50).
- Discount Amount (INR): The amount of discount availed by the customer.
- Gross Amount: The total amount before applying any discount.
- Net Amount: The final amount after applying the discount.
- Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.).
- Location: The city where the purchase took place.
E-commerce segmentation and recommender system¶
CSCA 5632 Unsupervised Algorithms in Machine Learning Final Project¶
Nicolas Veas
GITHUB REPOSITORY¶
https://github.com/nicovy/bank-personal-Loan
Introduction¶
Understanding the audience of an ecommerce platform is crucial for effective marketing, as it enables the use of personalized strategies. Developing a customer segmentation model allows us to understand the different profiles and preferences of the ecommerce customers. This enhances customer engagement and also supports the implementation of a recommender system that could help improving the shopping experience by providing personalized product suggestions that a certain customer will be likely to purchase.
In this project, I'll use unsupervised learning model to segment customers and build a recommender system.
The data used is from kaggle from the following link:
https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis
Data¶
- The data set consists of 13 features and 55000 entires of customer transactions.
Objective¶
- To segment customers into distinct groups
- To build a recommender system that recommend products to customers
Data Dictionary¶
- CID (Customer ID): A unique identifier for each customer.
- TID (Transaction ID): A unique identifier for each transaction.
- Gender: The gender of the customer, categorized as Male or Female.
- Age Group: Age group of the customer, divided into several ranges.
- Purchase Date: The timestamp of when the transaction took place.
- Product Category: The category of the product purchased, such as Electronics, Apparel, etc.
- Discount Availed: Indicates whether the customer availed any discount (Yes/No).
- Discount Name: Name of the discount applied (e.g., FESTIVE50).
- Discount Amount (INR): The amount of discount availed by the customer.
- Gross Amount: The total amount before applying any discount.
- Net Amount: The final amount after applying the discount.
- Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.).
- Location: The city where the purchase took place.
1. Initialization¶
1.1. Load Libraries¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from yellowbrick.cluster import KElbowVisualizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
%matplotlib inline
sns.set_theme()
pd.set_option("display.max_columns", None)
warnings.filterwarnings("ignore")
1.2. Load the data¶
data = pd.read_csv("ecommerce_dataset.csv")
# create a duplicate
df = data.copy()
2. Data Overview¶
2.1. Dataset Overview¶
df.head()
| CID | TID | Gender | Age Group | Purchase Date | Product Category | Discount Availed | Discount Name | Discount Amount (INR) | Gross Amount | Net Amount | Purchase Method | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 943146 | 5876328741 | Female | 25-45 | 30/08/2023 20:27:08 | Electronics | Yes | FESTIVE50 | 64.30 | 725.304000 | 661.004000 | Credit Card | Ahmedabad |
| 1 | 180079 | 1018503182 | Male | 25-45 | 23/02/2024 09:33:46 | Electronics | Yes | SEASONALOFFER21 | 175.19 | 4638.991875 | 4463.801875 | Credit Card | Bangalore |
| 2 | 337580 | 3814082218 | Other | 60 and above | 06/03/2022 09:09:50 | Clothing | Yes | SEASONALOFFER21 | 211.54 | 1986.372575 | 1774.832575 | Credit Card | Delhi |
| 3 | 180333 | 1395204173 | Other | 60 and above | 04/11/2020 04:41:57 | Sports & Fitness | No | NaN | 0.00 | 5695.612650 | 5695.612650 | Debit Card | Delhi |
| 4 | 447553 | 8009390577 | Male | 18-25 | 31/05/2022 17:00:32 | Sports & Fitness | Yes | WELCOME5 | 439.92 | 2292.651500 | 1852.731500 | Credit Card | Delhi |
display(df.info())
# data shape
display(df.shape)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 55000 entries, 0 to 54999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CID 55000 non-null int64 1 TID 55000 non-null int64 2 Gender 55000 non-null object 3 Age Group 55000 non-null object 4 Purchase Date 55000 non-null object 5 Product Category 55000 non-null object 6 Discount Availed 55000 non-null object 7 Discount Name 27415 non-null object 8 Discount Amount (INR) 55000 non-null float64 9 Gross Amount 55000 non-null float64 10 Net Amount 55000 non-null float64 11 Purchase Method 55000 non-null object 12 Location 55000 non-null object dtypes: float64(3), int64(2), object(8) memory usage: 5.5+ MB
None
(55000, 13)
Comments:¶
Discount Name has null variables, that is because not all purchases were made with discount codes
55000 rows
13 columns
3 Numerical variables
9 Categorical variables
Discount Ammount, Gross Amount and Net Amount are numerical variables.
CID, TID Gender, Age Group, purchase Date, product category, discount availed, discount name, purchase method and purchase location are Categorical variables.
2.2. Variables overview¶
for col in df.columns:
print(f"{col}: {df[col].nunique()} unique values")
CID: 29071 unique values TID: 55000 unique values Gender: 3 unique values Age Group: 5 unique values Purchase Date: 54988 unique values Product Category: 9 unique values Discount Availed: 2 unique values Discount Name: 5 unique values Discount Amount (INR): 20578 unique values Gross Amount: 54813 unique values Net Amount: 54938 unique values Purchase Method: 8 unique values Location: 14 unique values Purchase Method: 8 unique values Location: 14 unique values
# print unique variables for coluns with less than 15 unique values
for col in df.columns:
if df[col].nunique() < 15:
print(f"{col}: {df[col].unique()}")
Gender: ['Female' 'Male' 'Other'] Age Group: ['25-45' '60 and above' '18-25' '45-60' 'under 18'] Product Category: ['Electronics' 'Clothing' 'Sports & Fitness' 'Pet Care' 'Home & Kitchen' 'Books' 'Beauty and Health' 'Other' 'Toys & Games'] Discount Availed: ['Yes' 'No'] Discount Name: ['FESTIVE50' 'SEASONALOFFER21' nan 'WELCOME5' 'SAVE10' 'NEWYEARS'] Purchase Method: ['Credit Card' 'Debit Card' 'PhonePe UPI' 'Google Pay UPI' 'Net Banking' 'Cash on Delivery' 'Paytm UPI' 'International Card'] Location: ['Ahmedabad' 'Bangalore' 'Delhi' 'Other' 'Chennai' 'Dehradun' 'Pune' 'Hyderabad' 'Mumbai' 'Jaipur' 'Lucknow' 'Kolkata' 'Srinagar' 'Varanasi']
5 Different age groups
9 Product categories
8 Payment methods
All 14 locations are in India
2.3. Statistical summary¶
# Numerical
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CID | 55000.0 | 5.512456e+05 | 2.606033e+05 | 1.000090e+05 | 3.237170e+05 | 5.500885e+05 | 7.769558e+05 | 9.999960e+05 |
| TID | 55000.0 | 5.504740e+09 | 2.594534e+09 | 1.000163e+09 | 3.252604e+09 | 5.498383e+09 | 7.747933e+09 | 9.999393e+09 |
| Discount Amount (INR) | 55000.0 | 1.369868e+02 | 1.653755e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.741150e+02 | 5.000000e+02 |
| Gross Amount | 55000.0 | 3.012937e+03 | 1.718431e+03 | 1.364543e+02 | 1.562111e+03 | 2.954266e+03 | 4.342222e+03 | 8.394826e+03 |
| Net Amount | 55000.0 | 2.875950e+03 | 1.726128e+03 | -3.511198e+02 | 1.429552e+03 | 2.814911e+03 | 4.211408e+03 | 8.394826e+03 |
Mean discount of 136.986796 vs median of 0, which suggests the data is right-skewed
Gross Amount and Net Amount are also skewed to the right
There are negatives in Net Amount, which probably was mistyped
# Categorical
df.describe(include=["object", "category", "bool", 'datetime']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Gender | 55000 | 3 | Female | 18454 |
| Age Group | 55000 | 5 | 25-45 | 22010 |
| Purchase Date | 55000 | 54988 | 04/07/2022 11:45:29 | 2 |
| Product Category | 55000 | 9 | Electronics | 16574 |
| Discount Availed | 55000 | 2 | No | 27585 |
| Discount Name | 27415 | 5 | NEWYEARS | 8135 |
| Purchase Method | 55000 | 8 | Credit Card | 22096 |
| Location | 55000 | 14 | Mumbai | 11197 |
The data contains 29071 unique customers with a maximum of 8 purchases per customer
Gender is divided in 3 categories and majority of females
Most customers are in the age group of 25-45 yo
Electronics is the top category
50.15% of the purchases were made without discount
Credit card was the preferred purchase method among 8 different methods
Purchases were made from 14 different locations
3. Data Cleaning & Transformation¶
3.1. Correct data types¶
# CID and TID should be Categorical
df["CID"] = df["CID"].astype("object")
df["TID"] = df["TID"].astype("object")
df["Gender"] = df["Gender"].astype("category")
df["Age Group"] = df["Age Group"].astype("category")
df["Purchase Date"] = pd.to_datetime(df["Purchase Date"])
df["Product Category"] = df["Product Category"].astype("category")
df["Discount Availed"] = df["Discount Availed"].astype("category")
df["Discount Name"] = df["Discount Name"].astype("category")
df["Purchase Method"] = df["Purchase Method"].astype("category")
df["Location"] = df["Location"].astype("category")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 55000 entries, 0 to 54999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CID 55000 non-null object 1 TID 55000 non-null object 2 Gender 55000 non-null category 3 Age Group 55000 non-null category 4 Purchase Date 55000 non-null datetime64[ns] 5 Product Category 55000 non-null category 6 Discount Availed 55000 non-null category 7 Discount Name 27415 non-null category 8 Discount Amount (INR) 55000 non-null float64 9 Gross Amount 55000 non-null float64 10 Net Amount 55000 non-null float64 11 Purchase Method 55000 non-null category 12 Location 55000 non-null category dtypes: category(7), datetime64[ns](1), float64(3), object(2) memory usage: 2.9+ MB
3.2. Handle missing values¶
missing_data = df.isnull().sum()
print(missing_data)
CID 0 TID 0 Gender 0 Age Group 0 Purchase Date 0 Product Category 0 Discount Availed 0 Discount Name 27585 Discount Amount (INR) 0 Gross Amount 0 Net Amount 0 Purchase Method 0 Location 0 dtype: int64
# Discount Name is the only column with missing values
df[df['Discount Name'].isna()]['Discount Availed'].unique()
['No'] Categories (2, object): ['No', 'Yes']
# Because Discount Name is missing when Discount Availed is No, we can fill the missing values with 0
# add category 0 to Discount Name
try:
df["Discount Name"] = df["Discount Name"].cat.add_categories("0")
except:
pass
df["Discount Name"].fillna("0", inplace=True)
# print na count
print( f"{df["Discount Name"].isna().sum()} missing values in Discount Name")
df["Discount Name"].unique()
0 missing values in Discount Name
['FESTIVE50', 'SEASONALOFFER21', '0', 'WELCOME5', 'SAVE10', 'NEWYEARS'] Categories (6, object): ['FESTIVE50', 'NEWYEARS', 'SAVE10', 'SEASONALOFFER21', 'WELCOME5', '0']
3.3. Cleaning Duplicates¶
# print duplicated items
df[df.duplicated()]
| CID | TID | Gender | Age Group | Purchase Date | Product Category | Discount Availed | Discount Name | Discount Amount (INR) | Gross Amount | Net Amount | Purchase Method | Location |
|---|
No duplicates items to clean
3.4. Adjusting negative Net values¶
# print negative Net Amount
df[df["Net Amount"] < 0]
| CID | TID | Gender | Age Group | Purchase Date | Product Category | Discount Availed | Discount Name | Discount Amount (INR) | Gross Amount | Net Amount | Purchase Method | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 311 | 565676 | 1356242586 | Male | 45-60 | 2020-05-03 20:14:16 | Beauty and Health | Yes | NEWYEARS | 469.58 | 279.258000 | -190.322000 | Debit Card | Mumbai |
| 383 | 101206 | 4082960786 | Female | 18-25 | 2020-05-05 07:02:36 | Electronics | Yes | NEWYEARS | 358.58 | 204.886500 | -153.693500 | Debit Card | Delhi |
| 398 | 270557 | 9264884678 | Male | 45-60 | 2021-03-06 12:54:09 | Home & Kitchen | Yes | WELCOME5 | 385.36 | 246.808100 | -138.551900 | Debit Card | Bangalore |
| 489 | 334639 | 1610899978 | Male | 18-25 | 2022-09-18 22:31:07 | Home & Kitchen | Yes | NEWYEARS | 427.33 | 348.448275 | -78.881725 | Credit Card | Delhi |
| 564 | 924584 | 8564435752 | Other | 25-45 | 2020-02-13 05:04:55 | Electronics | Yes | SAVE10 | 429.49 | 414.914325 | -14.575675 | Credit Card | Delhi |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54271 | 654558 | 6092629329 | Male | 25-45 | 2023-02-19 10:08:25 | Electronics | Yes | SEASONALOFFER21 | 443.65 | 379.287000 | -64.363000 | Debit Card | Chennai |
| 54280 | 253300 | 9018344109 | Female | 60 and above | 2020-03-07 19:06:13 | Home & Kitchen | Yes | SEASONALOFFER21 | 305.25 | 178.402875 | -126.847125 | Paytm UPI | Ahmedabad |
| 54378 | 789978 | 1230560432 | Female | 25-45 | 2020-08-30 05:05:20 | Books | Yes | SAVE10 | 212.69 | 185.241000 | -27.449000 | PhonePe UPI | Mumbai |
| 54536 | 569063 | 8940735281 | Male | 45-60 | 2021-08-20 17:17:32 | Electronics | Yes | FESTIVE50 | 492.84 | 216.315000 | -276.525000 | Google Pay UPI | Lucknow |
| 54903 | 940867 | 8469174728 | Male | 25-45 | 2019-10-22 01:21:46 | Clothing | Yes | SEASONALOFFER21 | 362.14 | 247.819000 | -114.321000 | Debit Card | Kolkata |
613 rows × 13 columns
There are 613 entries with negative net amount
Net Amount is the difference between the Gross and Discount amount
If the discount is greater than the gross amount the net amount should be 0.
# Replace negative Net Amount with 0
df.loc[df["Net Amount"] < 0, "Net Amount"] = 0
df[["Net Amount"]].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Net Amount | 55000.0 | 2877.175022 | 1724.027159 | 0.0 | 1429.551863 | 2814.910875 | 4211.407838 | 8394.8256 |
3.5. Adjusting discount amount¶
# print discount amount over gross amount
df[df["Discount Amount (INR)"] > df["Gross Amount"]]
| CID | TID | Gender | Age Group | Purchase Date | Product Category | Discount Availed | Discount Name | Discount Amount (INR) | Gross Amount | Net Amount | Purchase Method | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 311 | 565676 | 1356242586 | Male | 45-60 | 2020-05-03 20:14:16 | Beauty and Health | Yes | NEWYEARS | 469.58 | 279.258000 | 0.0 | Debit Card | Mumbai |
| 383 | 101206 | 4082960786 | Female | 18-25 | 2020-05-05 07:02:36 | Electronics | Yes | NEWYEARS | 358.58 | 204.886500 | 0.0 | Debit Card | Delhi |
| 398 | 270557 | 9264884678 | Male | 45-60 | 2021-03-06 12:54:09 | Home & Kitchen | Yes | WELCOME5 | 385.36 | 246.808100 | 0.0 | Debit Card | Bangalore |
| 489 | 334639 | 1610899978 | Male | 18-25 | 2022-09-18 22:31:07 | Home & Kitchen | Yes | NEWYEARS | 427.33 | 348.448275 | 0.0 | Credit Card | Delhi |
| 564 | 924584 | 8564435752 | Other | 25-45 | 2020-02-13 05:04:55 | Electronics | Yes | SAVE10 | 429.49 | 414.914325 | 0.0 | Credit Card | Delhi |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54271 | 654558 | 6092629329 | Male | 25-45 | 2023-02-19 10:08:25 | Electronics | Yes | SEASONALOFFER21 | 443.65 | 379.287000 | 0.0 | Debit Card | Chennai |
| 54280 | 253300 | 9018344109 | Female | 60 and above | 2020-03-07 19:06:13 | Home & Kitchen | Yes | SEASONALOFFER21 | 305.25 | 178.402875 | 0.0 | Paytm UPI | Ahmedabad |
| 54378 | 789978 | 1230560432 | Female | 25-45 | 2020-08-30 05:05:20 | Books | Yes | SAVE10 | 212.69 | 185.241000 | 0.0 | PhonePe UPI | Mumbai |
| 54536 | 569063 | 8940735281 | Male | 45-60 | 2021-08-20 17:17:32 | Electronics | Yes | FESTIVE50 | 492.84 | 216.315000 | 0.0 | Google Pay UPI | Lucknow |
| 54903 | 940867 | 8469174728 | Male | 25-45 | 2019-10-22 01:21:46 | Clothing | Yes | SEASONALOFFER21 | 362.14 | 247.819000 | 0.0 | Debit Card | Kolkata |
613 rows × 13 columns
Discount amount should be less or equal to gross amount
# When discount amount is greater than gross amount, set discount amount to gross amount
df.loc[df["Discount Amount (INR)"] > df["Gross Amount"], "Discount Amount (INR)"] = df["Gross Amount"]
df[["Discount Amount (INR)", "Gross Amount", "Net Amount"]].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Discount Amount (INR) | 55000.0 | 135.761584 | 163.929401 | 0.000000 | 0.000000 | 0.000000 | 269.701875 | 500.0000 |
| Gross Amount | 55000.0 | 3012.936606 | 1718.431066 | 136.454325 | 1562.111325 | 2954.266150 | 4342.221675 | 8394.8256 |
| Net Amount | 55000.0 | 2877.175022 | 1724.027159 | 0.000000 | 1429.551863 | 2814.910875 | 4211.407838 | 8394.8256 |
3.6. Set 2 decimals to amounts¶
# round off to 2 decimal places
df[["Discount Amount (INR)", "Gross Amount", "Net Amount"]] = df[["Discount Amount (INR)", "Gross Amount", "Net Amount"]].round(2)
df[["Discount Amount (INR)", "Gross Amount", "Net Amount"]].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Discount Amount (INR) | 55000.0 | 135.761584 | 163.929400 | 0.00 | 0.0000 | 0.000 | 269.7025 | 500.00 |
| Gross Amount | 55000.0 | 3012.936601 | 1718.431053 | 136.45 | 1562.1125 | 2954.265 | 4342.2250 | 8394.83 |
| Net Amount | 55000.0 | 2877.175018 | 1724.027151 | 0.00 | 1429.5550 | 2814.910 | 4211.4100 | 8394.83 |
4. Feature Engineering¶
For clustering purposes, lets create a table in the following way
CID: id of the customer
gender: customer gender
age_group: customer age group
days_since_last_purchase: the number of days since last purchase
total_transactions: the total number of transactions
total_spent: the total amount spent in all transactions
avg_spent: the average spent per transaction
purchases_with_discount: The number of purchases with discount
total_discount: the total discounts in all transactions
avg_discount: the average discount per transaction
product_categories: count of unique purchased categories
favorite_product_category: favorite product category for each customer
avg_days_between_purchases: the average days between purchases
purchase_method: favorite purchase method for each customer
purchase_location: favorite purchase location
# Calculate the days since the last purchase for each customer
# Convert Purchase Date to datetime and extract only the date
df["purchase_day"] = df["Purchase Date"].dt.date
# Find the most recent purchase date for each customer
customer_data = df.groupby("CID")["purchase_day"].max().reset_index()
# Find the most recent date in the entire dataset
most_recent_date = df["purchase_day"].max()
# Convert purchase_day to datetime type before subtraction
customer_data["purchase_day"] = pd.to_datetime(customer_data["purchase_day"])
most_recent_date = pd.to_datetime(most_recent_date)
# Calculate the number of days since the last purchase for each customer
customer_data["days_since_last_purchase"] = (most_recent_date - customer_data["purchase_day"]).dt.days
# Remove the purchase_day column
df.drop(columns=["purchase_day"], inplace=True)
customer_data.drop(columns=["purchase_day"], inplace=True)
customer_data.head()
| CID | days_since_last_purchase | |
|---|---|---|
| 0 | 100009 | 287 |
| 1 | 100037 | 876 |
| 2 | 100063 | 1629 |
| 3 | 100089 | 262 |
| 4 | 100096 | 19 |
# add gender to customer_data
gender = df.groupby("CID")["Gender"].agg(lambda x: x.value_counts().index[0]).reset_index()
gender.columns = ["CID", "gender"]
customer_data = customer_data.merge(gender, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | |
|---|---|---|---|
| 0 | 100009 | 287 | Female |
| 1 | 100037 | 876 | Male |
| 2 | 100063 | 1629 | Female |
| 3 | 100089 | 262 | Other |
| 4 | 100096 | 19 | Female |
# add age_group to customer_data
age_group = df.groupby("CID")["Age Group"].agg(lambda x: x.value_counts().index[0]).reset_index()
age_group.columns = ["CID", "age_group"]
customer_data = customer_data.merge(age_group, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | |
|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 |
| 1 | 100037 | 876 | Male | 18-25 |
| 2 | 100063 | 1629 | Female | under 18 |
| 3 | 100089 | 262 | Other | 18-25 |
| 4 | 100096 | 19 | Female | 18-25 |
# Calculate the total transactions for each customer
total_transactions = df.groupby("CID")["TID"].count().reset_index()
total_transactions.columns = ["CID", "total_transactions"]
customer_data = customer_data.merge(total_transactions, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | total_transactions | |
|---|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 | 1 |
| 1 | 100037 | 876 | Male | 18-25 | 1 |
| 2 | 100063 | 1629 | Female | under 18 | 1 |
| 3 | 100089 | 262 | Other | 18-25 | 4 |
| 4 | 100096 | 19 | Female | 18-25 | 1 |
# Calculate the total amount spent by each customer
total_amount_spent = df.groupby("CID")["Net Amount"].sum().reset_index()
total_amount_spent.columns = ["CID", "total_amount_spent"]
customer_data = customer_data.merge(total_amount_spent, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | total_transactions | total_amount_spent | |
|---|---|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 | 1 | 6310.82 |
| 1 | 100037 | 876 | Male | 18-25 | 1 | 3147.99 |
| 2 | 100063 | 1629 | Female | under 18 | 1 | 522.90 |
| 3 | 100089 | 262 | Other | 18-25 | 4 | 12219.74 |
| 4 | 100096 | 19 | Female | 18-25 | 1 | 4320.68 |
# Calculate the average amount spent by each customer
average_amount_spent = df.groupby("CID")["Net Amount"].mean().reset_index()
average_amount_spent.columns = ["CID", "average_amount_spent"]
customer_data = customer_data.merge(average_amount_spent, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | total_transactions | total_amount_spent | average_amount_spent | |
|---|---|---|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 | 1 | 6310.82 | 6310.820 |
| 1 | 100037 | 876 | Male | 18-25 | 1 | 3147.99 | 3147.990 |
| 2 | 100063 | 1629 | Female | under 18 | 1 | 522.90 | 522.900 |
| 3 | 100089 | 262 | Other | 18-25 | 4 | 12219.74 | 3054.935 |
| 4 | 100096 | 19 | Female | 18-25 | 1 | 4320.68 | 4320.680 |
# Calculate the count of purchases with discount for each customer
purchases_with_discount = df[df["Discount Availed"] == "Yes"].groupby("CID")["TID"].count().reset_index()
purchases_with_discount.columns = ["CID", "purchases_with_discount"]
customer_data = customer_data.merge(purchases_with_discount, on="CID")
# Calculate the total discount amount used by each customer
total_discount_amount = df.groupby("CID")["Discount Amount (INR)"].sum().reset_index()
total_discount_amount.columns = ["CID", "total_discount_amount"]
customer_data = customer_data.merge(total_discount_amount, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | total_transactions | total_amount_spent | average_amount_spent | purchases_with_discount | total_discount_amount | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 | 1 | 6310.82 | 6310.8200 | 1 | 343.46 |
| 1 | 100037 | 876 | Male | 18-25 | 1 | 3147.99 | 3147.9900 | 1 | 425.15 |
| 2 | 100089 | 262 | Other | 18-25 | 4 | 12219.74 | 3054.9350 | 2 | 453.00 |
| 3 | 100205 | 701 | Female | 25-45 | 4 | 7696.69 | 1924.1725 | 2 | 499.37 |
| 4 | 100258 | 98 | Other | 25-45 | 3 | 8608.02 | 2869.3400 | 2 | 499.24 |
# Calculate the average discount amount used by each customer
average_discount_amount = df.groupby("CID")["Discount Amount (INR)"].mean().reset_index()
average_discount_amount.columns = ["CID", "average_discount_amount"]
customer_data = customer_data.merge(average_discount_amount, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | total_transactions | total_amount_spent | average_amount_spent | purchases_with_discount | total_discount_amount | average_discount_amount | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 | 1 | 6310.82 | 6310.8200 | 1 | 343.46 | 343.460000 |
| 1 | 100037 | 876 | Male | 18-25 | 1 | 3147.99 | 3147.9900 | 1 | 425.15 | 425.150000 |
| 2 | 100089 | 262 | Other | 18-25 | 4 | 12219.74 | 3054.9350 | 2 | 453.00 | 113.250000 |
| 3 | 100205 | 701 | Female | 25-45 | 4 | 7696.69 | 1924.1725 | 2 | 499.37 | 124.842500 |
| 4 | 100258 | 98 | Other | 25-45 | 3 | 8608.02 | 2869.3400 | 2 | 499.24 | 166.413333 |
# Calculate the unique categories purchased by each customer
unique_categories = df.groupby("CID")["Product Category"].nunique().reset_index()
unique_categories.columns = ["CID", "total_categories"]
customer_data = customer_data.merge(unique_categories, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | total_transactions | total_amount_spent | average_amount_spent | purchases_with_discount | total_discount_amount | average_discount_amount | total_categories | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 | 1 | 6310.82 | 6310.8200 | 1 | 343.46 | 343.460000 | 1 |
| 1 | 100037 | 876 | Male | 18-25 | 1 | 3147.99 | 3147.9900 | 1 | 425.15 | 425.150000 | 1 |
| 2 | 100089 | 262 | Other | 18-25 | 4 | 12219.74 | 3054.9350 | 2 | 453.00 | 113.250000 | 4 |
| 3 | 100205 | 701 | Female | 25-45 | 4 | 7696.69 | 1924.1725 | 2 | 499.37 | 124.842500 | 4 |
| 4 | 100258 | 98 | Other | 25-45 | 3 | 8608.02 | 2869.3400 | 2 | 499.24 | 166.413333 | 3 |
# calculates the favorite product category for each customer
favorite_product_category = df.groupby("CID")["Product Category"].agg(lambda x: x.value_counts().index[0]).reset_index()
favorite_product_category.columns = ["CID", "favorite_product_category"]
customer_data = customer_data.merge(favorite_product_category, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | total_transactions | total_amount_spent | average_amount_spent | purchases_with_discount | total_discount_amount | average_discount_amount | total_categories | favorite_product_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 | 1 | 6310.82 | 6310.8200 | 1 | 343.46 | 343.460000 | 1 | Electronics |
| 1 | 100037 | 876 | Male | 18-25 | 1 | 3147.99 | 3147.9900 | 1 | 425.15 | 425.150000 | 1 | Electronics |
| 2 | 100089 | 262 | Other | 18-25 | 4 | 12219.74 | 3054.9350 | 2 | 453.00 | 113.250000 | 4 | Beauty and Health |
| 3 | 100205 | 701 | Female | 25-45 | 4 | 7696.69 | 1924.1725 | 2 | 499.37 | 124.842500 | 4 | Beauty and Health |
| 4 | 100258 | 98 | Other | 25-45 | 3 | 8608.02 | 2869.3400 | 2 | 499.24 | 166.413333 | 3 | Clothing |
# calculates the average days between purchases for each customer
# Find the purchase dates for each customer
purchase_dates = df.groupby("CID")["Purchase Date"].unique().reset_index()
# Find the average days between purchases for each customer
purchase_dates["average_days_between_purchases"] = purchase_dates["Purchase Date"].apply(
lambda x: abs( np.mean(np.diff(x).astype("timedelta64[D]").astype(int)))
)
# Remove the Purchase Date column
purchase_dates.drop(columns=["Purchase Date"], inplace=True)
purchase_dates.fillna(0, inplace=True)
customer_data = customer_data.merge(purchase_dates, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | total_transactions | total_amount_spent | average_amount_spent | purchases_with_discount | total_discount_amount | average_discount_amount | total_categories | favorite_product_category | average_days_between_purchases | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 | 1 | 6310.82 | 6310.8200 | 1 | 343.46 | 343.460000 | 1 | Electronics | 0.000000 |
| 1 | 100037 | 876 | Male | 18-25 | 1 | 3147.99 | 3147.9900 | 1 | 425.15 | 425.150000 | 1 | Electronics | 0.000000 |
| 2 | 100089 | 262 | Other | 18-25 | 4 | 12219.74 | 3054.9350 | 2 | 453.00 | 113.250000 | 4 | Beauty and Health | 83.000000 |
| 3 | 100205 | 701 | Female | 25-45 | 4 | 7696.69 | 1924.1725 | 2 | 499.37 | 124.842500 | 4 | Beauty and Health | 176.666667 |
| 4 | 100258 | 98 | Other | 25-45 | 3 | 8608.02 | 2869.3400 | 2 | 499.24 | 166.413333 | 3 | Clothing | 608.000000 |
# calculates the favorite purchase method for each customer
favorite_purchase_method = df.groupby("CID")["Purchase Method"].agg(lambda x: x.value_counts().index[0]).reset_index()
favorite_purchase_method.columns = ["CID", "favorite_purchase_method"]
customer_data = customer_data.merge(favorite_purchase_method, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | total_transactions | total_amount_spent | average_amount_spent | purchases_with_discount | total_discount_amount | average_discount_amount | total_categories | favorite_product_category | average_days_between_purchases | favorite_purchase_method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 | 1 | 6310.82 | 6310.8200 | 1 | 343.46 | 343.460000 | 1 | Electronics | 0.000000 | Credit Card |
| 1 | 100037 | 876 | Male | 18-25 | 1 | 3147.99 | 3147.9900 | 1 | 425.15 | 425.150000 | 1 | Electronics | 0.000000 | Credit Card |
| 2 | 100089 | 262 | Other | 18-25 | 4 | 12219.74 | 3054.9350 | 2 | 453.00 | 113.250000 | 4 | Beauty and Health | 83.000000 | Credit Card |
| 3 | 100205 | 701 | Female | 25-45 | 4 | 7696.69 | 1924.1725 | 2 | 499.37 | 124.842500 | 4 | Beauty and Health | 176.666667 | Cash on Delivery |
| 4 | 100258 | 98 | Other | 25-45 | 3 | 8608.02 | 2869.3400 | 2 | 499.24 | 166.413333 | 3 | Clothing | 608.000000 | Credit Card |
# calculates the favorite location for each customer
favorite_location = df.groupby("CID")["Location"].agg(lambda x: x.value_counts().index[0]).reset_index()
favorite_location.columns = ["CID", "favorite_location"]
customer_data = customer_data.merge(favorite_location, on="CID")
customer_data.head()
| CID | days_since_last_purchase | gender | age_group | total_transactions | total_amount_spent | average_amount_spent | purchases_with_discount | total_discount_amount | average_discount_amount | total_categories | favorite_product_category | average_days_between_purchases | favorite_purchase_method | favorite_location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100009 | 287 | Female | 25-45 | 1 | 6310.82 | 6310.8200 | 1 | 343.46 | 343.460000 | 1 | Electronics | 0.000000 | Credit Card | Delhi |
| 1 | 100037 | 876 | Male | 18-25 | 1 | 3147.99 | 3147.9900 | 1 | 425.15 | 425.150000 | 1 | Electronics | 0.000000 | Credit Card | Hyderabad |
| 2 | 100089 | 262 | Other | 18-25 | 4 | 12219.74 | 3054.9350 | 2 | 453.00 | 113.250000 | 4 | Beauty and Health | 83.000000 | Credit Card | Mumbai |
| 3 | 100205 | 701 | Female | 25-45 | 4 | 7696.69 | 1924.1725 | 2 | 499.37 | 124.842500 | 4 | Beauty and Health | 176.666667 | Cash on Delivery | Ahmedabad |
| 4 | 100258 | 98 | Other | 25-45 | 3 | 8608.02 | 2869.3400 | 2 | 499.24 | 166.413333 | 3 | Clothing | 608.000000 | Credit Card | Chennai |
#print final table info
customer_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 19478 entries, 100009 to 999996 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 days_since_last_purchase 19478 non-null int64 1 gender 19478 non-null category 2 age_group 19478 non-null category 3 total_transactions 19478 non-null int64 4 total_amount_spent 19478 non-null float64 5 average_amount_spent 19478 non-null float64 6 purchases_with_discount 19478 non-null int64 7 total_discount_amount 19478 non-null float64 8 average_discount_amount 19478 non-null float64 9 total_categories 19478 non-null int64 10 favorite_product_category 19478 non-null category 11 average_days_between_purchases 19478 non-null float64 12 favorite_purchase_method 19478 non-null category 13 favorite_location 19478 non-null category 14 cluster 19478 non-null int32 dtypes: category(5), float64(5), int32(1), int64(4) memory usage: 1.7 MB
4. EDA¶
4.1. Correlation Analysis¶
4.1.1. Purchases Correlation¶
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, linewidths=0.5, cmap="coolwarm", vmin=-1, vmax=1)
<Axes: >
Gross and net are highly correlated as expected
Discount is very low correlated to gross and net ammounts
4.1.2. Customer Correlation¶
plt.figure(figsize=(10, 5))
sns.heatmap(customer_data.corr(numeric_only=True), annot=True, linewidths=0.5, cmap="coolwarm", vmin=-1, vmax=1)
<Axes: >
Total categories are high correlated with total transactions which means that usually customers shop from different categories
Total transaction is high correlated
Average discount and total discount are also high correlated
Total amount spent and total categories are high correlated
4.2. Pairplot¶
4.2.1. Purchases pairplot¶
sns.pairplot(df, diag_kind="kde");
Gross and Net ammount are highly correlated with linear relationship
for discount ammount we can identified 2 different groups, purchases with 0 discount and purchases with discount.
4.2.2 Customer data pairplot¶
sns.pairplot(customer_data, diag_kind="kde");
people with more transactions trend to buy pore often
we ca appreciate linear relation between total_amount and average_amount, and between total_discount and average_discount
4.3 Categorical variables¶
# funtction to plot countplot
def plot_countplot(data, x, size, title=None):
total = float(len(data))
if size is None:
plt.figure()
else:
plt.figure(figsize=size)
ax = sns.countplot(data, x=x)
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x() + p.get_width() / 2.0, height + 3, f"{height / total:.2%}\n({height})", ha="center")
plt.title(title)
plt.show()
plot_countplot(customer_data, 'gender', None, 'Gender')
Most of the customer are Female
plot_countplot(customer_data, "age_group", None, "Age Group")
~ 80% of customers in ages between 18-45
Very few customers in ages above 60 and under 18
plot_countplot(customer_data, "favorite_product_category", (10, 5), "Favorite Product Category")
Most popular categories are Electronics, Clothing, and Beauty and Health (~ 75%)
Pet care and toys & Games have very few amount
plot_countplot(customer_data, "favorite_purchase_method", (10, 5), "Favorite Purchase Method")
Most of the payments are made using credit cards (~74%)
The second most popular payment method is debit card
plot_countplot(customer_data, "favorite_location", (12, 5), "Favorite Location")
From the 29071 customers, ~ 57% are from Bangalore, Delhi, and Mumbai
Dehradun, Srinagar, Varanasi, and Other have under 1% of customer each
5. Data Preparation¶
5.1. Set CID as index¶
customer_data.set_index("CID", inplace=True)
5.2. One hot encode for categorical¶
# Getting Dummies for the categorical variables.
customer_data_ohe = pd.get_dummies(customer_data, drop_first=True, dtype=np.uint8)
# dummy_data["Mortgage"] = np.log(dummy_data["Mortgage"] + 1)
display(customer_data_ohe.head())
customer_data_ohe.info()
| days_since_last_purchase | total_transactions | total_amount_spent | average_amount_spent | purchases_with_discount | total_discount_amount | average_discount_amount | total_categories | average_days_between_purchases | gender_Male | gender_Other | age_group_25-45 | age_group_45-60 | age_group_60 and above | age_group_under 18 | favorite_product_category_Books | favorite_product_category_Clothing | favorite_product_category_Electronics | favorite_product_category_Home & Kitchen | favorite_product_category_Other | favorite_product_category_Pet Care | favorite_product_category_Sports & Fitness | favorite_product_category_Toys & Games | favorite_purchase_method_Credit Card | favorite_purchase_method_Debit Card | favorite_purchase_method_Google Pay UPI | favorite_purchase_method_International Card | favorite_purchase_method_Net Banking | favorite_purchase_method_Paytm UPI | favorite_purchase_method_PhonePe UPI | favorite_location_Bangalore | favorite_location_Chennai | favorite_location_Dehradun | favorite_location_Delhi | favorite_location_Hyderabad | favorite_location_Jaipur | favorite_location_Kolkata | favorite_location_Lucknow | favorite_location_Mumbai | favorite_location_Other | favorite_location_Pune | favorite_location_Srinagar | favorite_location_Varanasi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CID | |||||||||||||||||||||||||||||||||||||||||||
| 100009 | 287 | 1 | 6310.82 | 6310.8200 | 1 | 343.46 | 343.460000 | 1 | 0.000000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100037 | 876 | 1 | 3147.99 | 3147.9900 | 1 | 425.15 | 425.150000 | 1 | 0.000000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100089 | 262 | 4 | 12219.74 | 3054.9350 | 2 | 453.00 | 113.250000 | 4 | 83.000000 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 100205 | 701 | 4 | 7696.69 | 1924.1725 | 2 | 499.37 | 124.842500 | 4 | 176.666667 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100258 | 98 | 3 | 8608.02 | 2869.3400 | 2 | 499.24 | 166.413333 | 3 | 608.000000 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
<class 'pandas.core.frame.DataFrame'> Index: 19478 entries, 100009 to 999996 Data columns (total 43 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 days_since_last_purchase 19478 non-null int64 1 total_transactions 19478 non-null int64 2 total_amount_spent 19478 non-null float64 3 average_amount_spent 19478 non-null float64 4 purchases_with_discount 19478 non-null int64 5 total_discount_amount 19478 non-null float64 6 average_discount_amount 19478 non-null float64 7 total_categories 19478 non-null int64 8 average_days_between_purchases 19478 non-null float64 9 gender_Male 19478 non-null uint8 10 gender_Other 19478 non-null uint8 11 age_group_25-45 19478 non-null uint8 12 age_group_45-60 19478 non-null uint8 13 age_group_60 and above 19478 non-null uint8 14 age_group_under 18 19478 non-null uint8 15 favorite_product_category_Books 19478 non-null uint8 16 favorite_product_category_Clothing 19478 non-null uint8 17 favorite_product_category_Electronics 19478 non-null uint8 18 favorite_product_category_Home & Kitchen 19478 non-null uint8 19 favorite_product_category_Other 19478 non-null uint8 20 favorite_product_category_Pet Care 19478 non-null uint8 21 favorite_product_category_Sports & Fitness 19478 non-null uint8 22 favorite_product_category_Toys & Games 19478 non-null uint8 23 favorite_purchase_method_Credit Card 19478 non-null uint8 24 favorite_purchase_method_Debit Card 19478 non-null uint8 25 favorite_purchase_method_Google Pay UPI 19478 non-null uint8 26 favorite_purchase_method_International Card 19478 non-null uint8 27 favorite_purchase_method_Net Banking 19478 non-null uint8 28 favorite_purchase_method_Paytm UPI 19478 non-null uint8 29 favorite_purchase_method_PhonePe UPI 19478 non-null uint8 30 favorite_location_Bangalore 19478 non-null uint8 31 favorite_location_Chennai 19478 non-null uint8 32 favorite_location_Dehradun 19478 non-null uint8 33 favorite_location_Delhi 19478 non-null uint8 34 favorite_location_Hyderabad 19478 non-null uint8 35 favorite_location_Jaipur 19478 non-null uint8 36 favorite_location_Kolkata 19478 non-null uint8 37 favorite_location_Lucknow 19478 non-null uint8 38 favorite_location_Mumbai 19478 non-null uint8 39 favorite_location_Other 19478 non-null uint8 40 favorite_location_Pune 19478 non-null uint8 41 favorite_location_Srinagar 19478 non-null uint8 42 favorite_location_Varanasi 19478 non-null uint8 dtypes: float64(5), int64(4), uint8(34) memory usage: 2.1 MB
5.4 PCA¶
pca = PCA().fit(customer_data_ohe)
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
plt.figure(figsize=(20, 7))
barplot = sns.barplot(x=list(range(1, len(cumulative_explained_variance) + 1)), y=explained_variance_ratio, alpha=0.8)
(lineplot,) = plt.plot(
range(0, len(cumulative_explained_variance)), cumulative_explained_variance, marker="o", linestyle="--", color="#003cff", linewidth=2
)
plt.xlabel("Number of Components", fontsize=14)
plt.ylabel("Explained Variance", fontsize=14)
plt.title("Cumulative Variance vs. Number of Components", fontsize=18)
x_offset = -0.3
y_offset = 0.01
for i, (ev_ratio, cum_ev_ratio) in enumerate(zip(explained_variance_ratio, cumulative_explained_variance)):
plt.text(i, ev_ratio, f"{ev_ratio:.2f}", ha="center", va="bottom", fontsize=10)
if i > 0:
plt.text(i + x_offset, cum_ev_ratio + y_offset, f"{cum_ev_ratio:.2f}", ha="center", va="bottom", fontsize=10)
plt.show()
We will chose 3 as the number of components as this explain .99 of the variance
customer_data_pca = PCA(n_components=3).fit_transform(customer_data_ohe)
customer_data_pca = pd.DataFrame(customer_data_pca, columns=["PC" + str(i + 1) for i in range(3)])
customer_data_pca.index = customer_data_ohe.index
customer_data_pca.head()
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| CID | |||
| 100009 | 964.712058 | 3350.665445 | -442.421111 |
| 100037 | -2778.402804 | 917.955536 | 150.145047 |
| 100089 | 6116.099049 | -965.410541 | 111.787897 |
| 100205 | 1448.167445 | -1173.823931 | 274.310040 |
| 100258 | 2555.517855 | -494.327297 | -486.139511 |
6. K-Means Modeling¶
6.1. Determine Number of Clusters¶
Lets use Elbow Method
km = KMeans(init="k-means++", n_init=10, max_iter=100, random_state=0)
fig, ax = plt.subplots(figsize=(12, 5))
visualizer = KElbowVisualizer(km, k=(1, 10), timings=False, ax=ax)
visualizer.fit(customer_data_pca)
visualizer.show();
The elbow method suggest an optimal number of clusters of 3
6.2 KMeans Clustering model¶
km = KMeans(init="k-means++", n_init=10, max_iter=100, random_state=0, n_clusters=3)
km.fit(customer_data_pca)
# Append the cluster labels to the original dataframes
customer_data["cluster"] = km.labels_
customer_data_ohe["cluster"] = km.labels_
customer_data_pca["cluster"] = km.labels_
customer_data.head()
| days_since_last_purchase | gender | age_group | total_transactions | total_amount_spent | average_amount_spent | purchases_with_discount | total_discount_amount | average_discount_amount | total_categories | favorite_product_category | average_days_between_purchases | favorite_purchase_method | favorite_location | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CID | |||||||||||||||
| 100009 | 287 | Female | 25-45 | 1 | 6310.82 | 6310.8200 | 1 | 343.46 | 343.460000 | 1 | Electronics | 0.000000 | Credit Card | Delhi | 0 |
| 100037 | 876 | Male | 18-25 | 1 | 3147.99 | 3147.9900 | 1 | 425.15 | 425.150000 | 1 | Electronics | 0.000000 | Credit Card | Hyderabad | 2 |
| 100089 | 262 | Other | 18-25 | 4 | 12219.74 | 3054.9350 | 2 | 453.00 | 113.250000 | 4 | Beauty and Health | 83.000000 | Credit Card | Mumbai | 1 |
| 100205 | 701 | Female | 25-45 | 4 | 7696.69 | 1924.1725 | 2 | 499.37 | 124.842500 | 4 | Beauty and Health | 176.666667 | Cash on Delivery | Ahmedabad | 0 |
| 100258 | 98 | Other | 25-45 | 3 | 8608.02 | 2869.3400 | 2 | 499.24 | 166.413333 | 3 | Clothing | 608.000000 | Credit Card | Chennai | 0 |
6.3. Cluster Analysis¶
6.3.1. Distribution¶
plot_countplot(customer_data, "cluster", (10, 5), "Cluster Distribution")
Cluster 0 and 2 have a balanced amount of the customers
Cluster 1 has the least customers (4638)
6.3.2. Cluster profiles¶
# print statistics for each cluster
print("Cluster 0")
display(customer_data[customer_data["cluster"] == 0].describe().T)
print("Cluster 1")
display(customer_data[customer_data["cluster"] == 1].describe().T)
print("Cluster 2")
display(customer_data[customer_data["cluster"] == 2].describe().T)
Cluster 0
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| days_since_last_purchase | 8549.0 | 596.552111 | 441.318714 | 0.000000 | 232.000 | 501.000000 | 895.00 | 1825.00 |
| total_transactions | 8549.0 | 2.201427 | 0.839420 | 1.000000 | 2.000 | 2.000000 | 3.00 | 6.00 |
| total_amount_spent | 8549.0 | 6712.037189 | 1587.286547 | 4017.820000 | 5345.300 | 6547.610000 | 7980.44 | 9956.45 |
| average_amount_spent | 8549.0 | 3375.940286 | 1101.009197 | 897.986667 | 2545.505 | 3128.943333 | 4183.22 | 8097.20 |
| purchases_with_discount | 8549.0 | 1.417593 | 0.623513 | 1.000000 | 1.000 | 1.000000 | 2.00 | 5.00 |
| total_discount_amount | 8549.0 | 382.465446 | 231.075577 | 50.090000 | 209.340 | 355.590000 | 486.14 | 2144.86 |
| average_discount_amount | 8549.0 | 186.150398 | 107.362983 | 13.207500 | 102.675 | 169.975000 | 244.11 | 499.71 |
| total_categories | 8549.0 | 1.917768 | 0.729035 | 1.000000 | 1.000 | 2.000000 | 2.00 | 5.00 |
| average_days_between_purchases | 8549.0 | 386.477711 | 393.049654 | 0.000000 | 47.000 | 274.000000 | 598.00 | 1799.00 |
| cluster | 8549.0 | 0.000000 | 0.000000 | 0.000000 | 0.000 | 0.000000 | 0.00 | 0.00 |
Cluster 1
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| days_since_last_purchase | 3166.0 | 385.468099 | 311.466631 | 0.000000 | 140.000000 | 307.0000 | 571.750000 | 1760.000 |
| total_transactions | 3166.0 | 3.642135 | 1.059902 | 2.000000 | 3.000000 | 3.0000 | 4.000000 | 8.000 |
| total_amount_spent | 3166.0 | 13096.474912 | 2957.425130 | 9826.310000 | 10894.320000 | 12241.7050 | 14415.855000 | 27966.340 |
| average_amount_spent | 3166.0 | 3765.268475 | 877.124333 | 1700.631667 | 3191.727375 | 3653.7100 | 4210.005833 | 7068.925 |
| purchases_with_discount | 3166.0 | 1.964624 | 0.950603 | 1.000000 | 1.000000 | 2.0000 | 3.000000 | 7.000 |
| total_discount_amount | 3166.0 | 535.171649 | 309.945559 | 50.680000 | 303.812500 | 479.7800 | 724.000000 | 2195.370 |
| average_discount_amount | 3166.0 | 149.337923 | 78.925039 | 10.868000 | 88.889583 | 142.4575 | 199.119375 | 481.635 |
| total_categories | 3166.0 | 2.847126 | 0.871528 | 1.000000 | 2.000000 | 3.0000 | 3.000000 | 7.000 |
| average_days_between_purchases | 3166.0 | 279.055559 | 250.433437 | 0.000000 | 96.000000 | 213.7500 | 390.562500 | 1661.000 |
| cluster | 3166.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.0000 | 1.000000 | 1.000 |
Cluster 2
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| days_since_last_purchase | 7763.0 | 821.256859 | 523.479049 | 0.0000 | 362.000000 | 788.000 | 1259.0000 | 1826.00 |
| total_transactions | 7763.0 | 1.445962 | 0.616914 | 1.0000 | 1.000000 | 1.000 | 2.0000 | 5.00 |
| total_amount_spent | 7763.0 | 2422.972029 | 1295.232631 | 0.0000 | 1366.460000 | 2529.300 | 3521.2100 | 4977.71 |
| average_amount_spent | 7763.0 | 1783.638903 | 1026.171580 | 0.0000 | 1026.396667 | 1696.270 | 2363.3550 | 4063.84 |
| purchases_with_discount | 7763.0 | 1.169136 | 0.405280 | 1.0000 | 1.000000 | 1.000 | 1.0000 | 4.00 |
| total_discount_amount | 7763.0 | 322.405843 | 177.763813 | 50.0100 | 185.290000 | 308.230 | 430.7400 | 1341.79 |
| average_discount_amount | 7763.0 | 241.459791 | 126.841281 | 16.5525 | 136.755000 | 225.295 | 342.2675 | 499.58 |
| total_categories | 7763.0 | 1.356306 | 0.546057 | 1.0000 | 1.000000 | 1.000 | 2.0000 | 4.00 |
| average_days_between_purchases | 7763.0 | 215.066823 | 375.528855 | 0.0000 | 0.000000 | 0.000 | 310.0000 | 1819.00 |
| cluster | 7763.0 | 2.000000 | 0.000000 | 2.0000 | 2.000000 | 2.000 | 2.0000 | 2.00 |
for column in customer_data.columns:
if column == "cluster":
continue
typec = type(customer_data[column].iloc[0])
if typec == str:
fig, ax = plt.subplots(1, 3, figsize=(15, 3))
sns.countplot(customer_data[customer_data['cluster'] == 0], x=column, ax=ax[0])
sns.countplot(customer_data[customer_data["cluster"] == 1], x=column, ax=ax[1])
sns.countplot(customer_data[customer_data["cluster"] == 1], x=column, ax=ax[2])
ax[0].set_title("Cluster 0")
ax[1].set_title("Cluster 1")
ax[2].set_title("Cluster 2")
else:
plt.figure(figsize=(5, 3))
sns.boxplot(customer_data, y=column, x="cluster")
plt.show()
The 3 clusters have similar distribution for gender, age group. favorite product category, favorite purchase method, and purchase location.
Cluster 0 - Casual buyers:
Customers that buys more than once
All age ranges
Total spend between 4000 and 9000
They spend between 2500 to 5000 per purchase
days between purchases is under 1000 days
Cluster 1 - Recurrent buyers:
Customers with many purchases
Age range from 18 to 60 years old
They have a total spend over 10000
Most of the purchases are with discount
They spend between 2500 to 5000 per purchase
Time between purchases is under 500 days
Cluster 3 - One time buyers:
Customers that are one or two time buyers.
All age ranges
buy items from all of the categories
Spend under 4000 in total
Spend under 3500 per purchase
Buys with the highest discount amount
Buys from one or two product categories
7. Recommender system¶
Lets use Matrix factorization to build a recommender system
7.1. Prepare the data¶
Given that amount spend was the most differentiated in the clustering we will create the matrix in the following order
p_cat_1_0 -> count of purchases was more than 0 but less than 2500
p_cat_1_2500 -> count of purchases was more than 2500 but less than 3200
p_cat_1_3200 -> count of purchases was more than 3200
The matrix will looks like this
CID p_cat_1_0 p_cat_1_2500 p_cat_1_3200 p_cat_2_0 ...
1 1 0 0 0 ...
2 0 2 0 0 ...
3 3 1 0 0 ...
.. ... ... ... ... ...
# duplicate df
df_temp = df.copy()
# set datatype for df_temp["Product Category"] to string
df_temp["Product Category"] = df_temp["Product Category"].astype(str)
# updtade product category adding _0 if net ammount is under 2500
df_temp["Product Category"] = np.where(df_temp["Net Amount"] < 2500, df_temp["Product Category"] + "_0", df_temp["Product Category"])
df_temp["Product Category"] = np.where(
(df_temp["Net Amount"] >= 2500) & (df_temp["Net Amount"] < 3200), df_temp["Product Category"] + "_2500", df_temp["Product Category"]
)
df_temp["Product Category"] = np.where(
(df_temp["Net Amount"] >= 3200), df_temp["Product Category"] + "_3200", df_temp["Product Category"]
)
matrix = df_temp.pivot_table(index="CID", columns="Product Category", values="Net Amount", aggfunc="count", fill_value=0)
# replace values over 0 and under 2500 with 1
matrix[(matrix > 0) & (matrix<2500)] = 1
# replace values over 2500 and under 3200 with 2
matrix[(matrix >= 2500) & (matrix < 3200)] = 2
# replace values over 3200 with 3
matrix[matrix >= 3200] = 3
# convert matrix to integer
matrix = matrix.astype(int)
# print the matrix
matrix.head()
| Product Category | Beauty and Health_0 | Beauty and Health_2500 | Beauty and Health_3200 | Books_0 | Books_2500 | Books_3200 | Clothing_0 | Clothing_2500 | Clothing_3200 | Electronics_0 | Electronics_2500 | Electronics_3200 | Home & Kitchen_0 | Home & Kitchen_2500 | Home & Kitchen_3200 | Other_0 | Other_2500 | Other_3200 | Pet Care_0 | Pet Care_2500 | Pet Care_3200 | Sports & Fitness_0 | Sports & Fitness_2500 | Sports & Fitness_3200 | Toys & Games_0 | Toys & Games_2500 | Toys & Games_3200 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CID | |||||||||||||||||||||||||||
| 100009 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100037 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100063 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100089 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100096 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
7.3 Build matrix factorization model¶
from sklearn.decomposition import NMF
# Create an NMF model
nmf = NMF(n_components=10, init="random", random_state=0)
W = nmf.fit_transform(matrix)
H = nmf.components_
V = W @ H
7.5 Calculate RMSE¶
from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(matrix, V)
print("RMSE:", rmse)
RMSE: 0.10788207134618165
we got a RMSE of 0.1079
7.6. Hyperparameter tuning¶
# lets try different parameters
best_model = None
best_rmse = 999.0
for i in range(1, 40):
try:
nmf = NMF(n_components=i, init="random", random_state=0)
W = nmf.fit_transform(matrix)
H = nmf.components_
V = W @ H
RMSE = root_mean_squared_error(matrix, V)
if RMSE < best_rmse:
best_rmse = RMSE
best_model = W
print("Best RSME so far: ", RMSE, "n_components:", i)
except Exception as e:
continue
Best RSME so far: 0.2168273984544131 n_components: 1 Best RSME so far: 0.20211507943063522 n_components: 2 Best RSME so far: 0.18787689407365077 n_components: 3 Best RSME so far: 0.17421905825865078 n_components: 4 Best RSME so far: 0.16209494214953132 n_components: 5 Best RSME so far: 0.14797761287088693 n_components: 6 Best RSME so far: 0.137406625599683 n_components: 7 Best RSME so far: 0.1309429237823241 n_components: 8 Best RSME so far: 0.11883297609958395 n_components: 9 Best RSME so far: 0.10788207134618165 n_components: 10 Best RSME so far: 0.09445696146083804 n_components: 11 Best RSME so far: 0.08620328235067205 n_components: 12 Best RSME so far: 0.07928958461533257 n_components: 13 Best RSME so far: 0.07132522465338667 n_components: 14 Best RSME so far: 0.06441358406793622 n_components: 15 Best RSME so far: 0.05844276395140894 n_components: 16 Best RSME so far: 0.050453159397099635 n_components: 17 Best RSME so far: 0.04533619610302071 n_components: 18 Best RSME so far: 0.04040223527183198 n_components: 19 Best RSME so far: 0.03421164798983782 n_components: 20 Best RSME so far: 0.028484281349931348 n_components: 21 Best RSME so far: 0.0212148377766173 n_components: 22 Best RSME so far: 0.01432231482756502 n_components: 23 Best RSME so far: 0.014315426984066145 n_components: 24 Best RSME so far: 0.010019519476411552 n_components: 25 Best RSME so far: 0.006352990503817964 n_components: 26 Best RSME so far: 0.0031682146977330965 n_components: 28 Best RSME so far: 4.941311530600969e-05 n_components: 29 Best RSME so far: 5.0282295472848175e-06 n_components: 34
The best parameters are n_components: 34
RMSE: 5.0282295472848175e-06
7.7. Build optimal model¶
# rebuild the best model
nmf = NMF(n_components=34, init="random", random_state=0)
W_optimal = nmf.fit_transform(matrix)
H_optimal = nmf.components_
V_optimal = pd.DataFrame(W_optimal @ H_optimal, columns=matrix.columns)
V_optimal.index = matrix.index
V_optimal
| Product Category | Beauty and Health_0 | Beauty and Health_2500 | Beauty and Health_3200 | Books_0 | Books_2500 | Books_3200 | Clothing_0 | Clothing_2500 | Clothing_3200 | Electronics_0 | Electronics_2500 | Electronics_3200 | Home & Kitchen_0 | Home & Kitchen_2500 | Home & Kitchen_3200 | Other_0 | Other_2500 | Other_3200 | Pet Care_0 | Pet Care_2500 | Pet Care_3200 | Sports & Fitness_0 | Sports & Fitness_2500 | Sports & Fitness_3200 | Toys & Games_0 | Toys & Games_2500 | Toys & Games_3200 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CID | |||||||||||||||||||||||||||
| 100009 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 1.334288e-11 | 0.000000e+00 | 0.000000e+00 | 7.435764e-08 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.113569e-17 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.657976e-10 | 0.000000e+00 | 0.000000e+00 | 1.473564e-11 | 0.000000e+00 | 1.573844e-11 | 0.000000e+00 | 0.0 |
| 100037 | 2.404599e-11 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | 2.586357e-17 | 0.000000e+00 | 2.113218e-08 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 5.400468e-11 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.0 |
| 100063 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.345025e-09 | 0.0 |
| 100089 | 1.434966e-10 | 1.235412e-13 | 1.000000 | 2.872113e-27 | 8.238349e-14 | 3.185483e-11 | 9.654683e-09 | 2.849260e-13 | 0.000000e+00 | 1.000000e+00 | 2.877394e-11 | 2.766075e-14 | 4.911494e-17 | 2.803937e-32 | 1.000000e+00 | 1.000000e+00 | 4.223620e-11 | 2.370130e-14 | 1.752165e-12 | 1.551780e-10 | 6.599479e-24 | 1.959135e-12 | 3.006646e-11 | 2.216866e-12 | 3.149742e-11 | 1.639871e-08 | 0.0 |
| 100096 | 0.000000e+00 | 0.000000e+00 | 0.000032 | 0.000000e+00 | 0.000000e+00 | 6.281709e-11 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 7.062256e-11 | 2.964797e-17 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | 5.984497e-11 | 0.000000e+00 | 4.138161e-11 | 0.000000e+00 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 999876 | 2.933893e-22 | 1.235412e-13 | 1.000000 | 1.290401e-27 | 8.238349e-14 | 6.044733e-14 | 9.654683e-09 | 2.849260e-13 | 7.375162e-13 | 0.000000e+00 | 0.000000e+00 | 1.211447e-14 | 3.647252e-29 | 3.447572e-30 | 3.571555e-18 | 1.300203e-14 | 2.414496e-24 | 0.000000e+00 | 1.735000e-12 | 6.667696e-13 | 1.107502e-12 | 1.887475e-12 | 1.229290e-13 | 0.000000e+00 | 9.563016e-14 | 1.841057e-11 | 0.0 |
| 999883 | 8.970466e-10 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 1.504884e-11 | 2.951377e-08 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.798758e-10 | 1.276449e-18 | 1.935277e-17 | 0.000000e+00 | 0.000000e+00 | 1.198114e-14 | 2.109816e-10 | 0.000000e+00 | 1.073013e-13 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 2.204492e-11 | 1.315675e-11 | 2.392920e-11 | 0.000000e+00 | 0.0 |
| 999910 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.457470e-17 | 8.236753e-18 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.002654e-12 | 3.013170e-12 | 0.000000e+00 | 0.000000e+00 | 0.0 |
| 999924 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 3.199588e-11 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.397037e-13 | 0.000000e+00 | 0.000000e+00 | 8.224176e-09 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.805197e-11 | 1.149997e-12 | 0.000000e+00 | 0.000000e+00 | 0.0 |
| 999996 | 0.000000e+00 | 0.000000e+00 | 0.000003 | 1.000000e+00 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 9.320539e-34 | 3.569029e-34 | 0.000000e+00 | 9.114352e-38 | 2.703203e-17 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.179782e-26 | 4.762485e-39 | 0.000000e+00 | 1.782310e-10 | 0.000000e+00 | 0.000000e+00 | 6.539824e-45 | 0.000000e+00 | 1.166545e-44 | 1.345025e-09 | 0.0 |
29071 rows × 27 columns
This matrix give us the recommendation for each user, the higher the value the higher the recommendation
7.8. Test recommendations for users¶
# Print the top 3 recommendations for the first 5 users
for i in range(10):
print(f"Recommendations for user {V_optimal.index[i]}: {V_optimal.iloc[i].sort_values(ascending=False).head(5).index.tolist()}")
Recommendations for user 100009: ['Electronics_3200', 'Clothing_3200', 'Pet Care_2500', 'Toys & Games_0', 'Sports & Fitness_2500'] Recommendations for user 100037: ['Electronics_2500', 'Home & Kitchen_3200', 'Sports & Fitness_2500', 'Beauty and Health_0', 'Home & Kitchen_0'] Recommendations for user 100063: ['Books_0', 'Toys & Games_2500', 'Beauty and Health_0', 'Home & Kitchen_3200', 'Toys & Games_0'] Recommendations for user 100089: ['Home & Kitchen_3200', 'Beauty and Health_3200', 'Electronics_0', 'Other_0', 'Toys & Games_2500'] Recommendations for user 100096: ['Pet Care_3200', 'Beauty and Health_3200', 'Electronics_3200', 'Books_3200', 'Sports & Fitness_2500'] Recommendations for user 100097: ['Clothing_3200', 'Home & Kitchen_3200', 'Beauty and Health_3200', 'Sports & Fitness_2500', 'Books_3200'] Recommendations for user 100139: ['Electronics_3200', 'Clothing_3200', 'Pet Care_2500', 'Toys & Games_0', 'Sports & Fitness_2500'] Recommendations for user 100177: ['Other_3200', 'Home & Kitchen_3200', 'Toys & Games_0', 'Pet Care_2500', 'Home & Kitchen_0'] Recommendations for user 100193: ['Pet Care_2500', 'Home & Kitchen_0', 'Beauty and Health_3200', 'Pet Care_3200', 'Beauty and Health_0'] Recommendations for user 100205: ['Beauty and Health_0', 'Electronics_3200', 'Pet Care_0', 'Clothing_0', 'Electronics_0']
We can appreciate the top 5 recommendations for the first 10 users.
the _0, _2500, and _3200 represent the price range that the user is likely to purchase
When a user logs in into the e-commerce platform we can check this matrix and recommend to the user products in these categories and in the price range.
8. Conclusions¶
We analyzed the data of an e-commerce platform with 55000 entries (purchases)
We treated missing values, wrong data types, negative values, and adjusted discount for proper data analysis and modeling
We had to transform the data before implementing any model for getting meaningful insight of each segment
We implemented unsupervised models for an e-commerce platform
We implemented k-means for clustering and identify 3 different segments of users with deferent preferences and behaviors
one time buyers
casual buyers
recurrent buyers
We used Non-Negative Matrix Factorization to implement a recommender system, which is able suggest the top categories for each user and the price range that they are likely to purchase.